-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend Nemo AutoTokenizer & SentencePieceTokenizer API for TensorRT-LLM & AMMO evaluation scripts usage #8818
Conversation
jenkins |
0251e96
to
4abe31d
Compare
jenkins |
64350e1
to
708cde7
Compare
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
…zers Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
708cde7
to
4c2cc81
Compare
LGTM, but can we check if autotokenizer and local sentencepiece outputs match for the new methods implemented before we merge? it is probably fine if they're not identical, but we need to know where the differences arise and also document them if any |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
What does this PR do ?
Extending NeMo's AutoTokenizer and SentencePieceTokenizer for consistency with HuggingFace tokenizers API as it is used throughout TensorRT-LLM and AMMO in evaluation tools.
This is necessary to seamlessly enable model validation -- including quantized variants -- with the current scripts and to avoid code duplication.
Example use-cases to support:
encode
: link1 and link2decode
: link1batch_decode
: link1batch_encode_plus
link2pad_token_id
: link1Collection: NLP
Changelog
pad_token_id
(with setter enabled for HF AutoTokenizer wrapper) andeos_token_id
property attributes for HF tokenizerUsage
See examples listed above.
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.